Chapter 4: Classification and clustering

In this exercise I analyze Bostondata set available in MASS package.

Loading MASS and Boston data set

Boston data set contains information from 506 housing observation (rows) with 14 different variables (columns).

library(MASS)
data('Boston')

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Visualize and summarize data

Nice corrplot modification examples https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

library(corrplot)
library(RColorBrewer)


correlations <- cor(Boston)
round(correlations, digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
colors <- brewer.pal(n = 9, name = "Pastel1")
signf_test <- cor.mtest(Boston, conf.level = .95)

corrplot(correlations, type = 'upper', method = 'ellipse', order = "hclust", col = brewer.pal(n = 8, name = "PiYG"), bg = colors[length(colors)], 
         p.mat = signf_test$p, insig = 'p-value', sig.level = .05, tl.col = "black", tl.srt = 90)

Above corrplot shows negative correlations in pink and positive correlations in green. Narrowness of method = 'ellipse' indicates how high correlation is. For non-significant correlations (p>0.05), p-values are shown.

From the graph it can be observed that most of the variables correlate significantly with others. Only few pairs are not significantly correlated.

Scale variables and categorical crimes

To be able to accurately classify the data, variable values need to be scaled so that all variables have a mean value of 0. It is done as follows (when all variables are numerical, as expected for classification analysis):

scaled <- as.data.frame(scale(Boston))
class(scaled)
## [1] "data.frame"
str(scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
summary(scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Next, crim is cut to categorical variable according to quantiles to be able to later use it to train the model to predict the right crime rate class of an observation based on other variables.

bins = quantile(scaled$crim)

crime <- cut(scaled$crim, breaks = bins, label = c('low', 'med_low', 'med_high', 'high'), include.lowest = TRUE)

#count table for each category level
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
#replace original crim with categorical crime variable

scaled <- dplyr::select(scaled, -crim)
scaled <- data.frame(scaled, crime)
head(scaled)
##           zn      indus       chas        nox        rm        age
## 1  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
##        dis        rad        tax    ptratio     black      lstat
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866
## 6 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909
##         medv crime
## 1  0.1595278   low
## 2 -0.1014239   low
## 3  1.3229375   low
## 4  1.1815886   low
## 5  1.4860323   low
## 6  0.6705582   low

Divide data for training and test sets

To be able to evaluate how well our model is predicting crime rate, I want to separate small fraction of the data (20%) for testing it, so it will not be used for training the model. Observations are selected randomly below for training or test sets.

random_test_rows <- sample(nrow(scaled), size = nrow(scaled) * 0.2)

test_set <- scaled[random_test_rows, ]
train_set <- scaled[-random_test_rows, ]

#Check that resulting dfs are as should
dim(test_set)
## [1] 101  14
dim(train_set)
## [1] 405  14

Fitting linear discriminant analysis for crimes

Fitting classification model with lda() function using crimes as a categorical variable and all other (continuous) variables as predicting variables.

lda_fit <- lda(crime ~ ., data = train_set)
lda_fit
## Call:
## lda(crime ~ ., data = train_set)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2444444 0.2395062 0.2641975 0.2518519 
## 
## Group means:
##                   zn      indus         chas        nox         rm
## low       0.93615081 -0.9294381 -0.153023000 -0.8707555  0.4081018
## med_low  -0.08852702 -0.3016549 -0.028797094 -0.5630289 -0.1405428
## med_high -0.39828015  0.2742082  0.242805543  0.4433154  0.1626778
## high     -0.48724019  1.0171096 -0.002135914  1.0478514 -0.4429198
##                 age        dis        rad        tax     ptratio
## low      -0.8692298  0.8767904 -0.7022949 -0.7210280 -0.41010585
## med_low  -0.3824166  0.3201944 -0.5497161 -0.4604819 -0.08660373
## med_high  0.4541229 -0.4259312 -0.4022710 -0.2703515 -0.28380026
## high      0.7810925 -0.8372084  1.6382099  1.5141140  0.78087177
##               black       lstat        medv
## low       0.3755321 -0.75032455  0.48044567
## med_low   0.3456466 -0.15985254  0.01941197
## med_high  0.0766059  0.03064053  0.20261327
## high     -0.8956447  0.85714937 -0.64411392
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.10989300  0.678874731 -1.08645694
## indus    0.01434508 -0.390032763  0.26322751
## chas    -0.08190904 -0.045235593  0.07506216
## nox      0.36281943 -0.690378883 -1.29508611
## rm      -0.09637667 -0.179776941 -0.22089666
## age      0.29244889 -0.307623541 -0.18808909
## dis     -0.05077580 -0.303426481  0.17094388
## rad      2.97154503  0.950236058 -0.04910165
## tax      0.01915546 -0.026362292  0.58125932
## ptratio  0.13051316  0.004893263 -0.40919164
## black   -0.15546369  0.006575529  0.16626190
## lstat    0.18610569 -0.216682848  0.33462026
## medv     0.19409637 -0.363610561 -0.19118986
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9400 0.0456 0.0144

Visualization of model

Using ggsci color palette and ggord package to visualize lda_fit. To install ggord package from Github, I use install_github function from devtools package.

#Convert crime factor levels to numeric to plot in different colors
crime_levels <- as.numeric(train_set$crime)

#I hate the default colors of the plot, so I'm using ggsci package palettes instead
#Good source for color palettes https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/
library(ggsci)
library(devtools)
install_github("fawda123/ggord")
library(ggord)

cols <- pal_aaas()(4)


#Plot with nicer colors

ggord(lda_fit, train_set$crime, poly = FALSE, arrow=.3, veclsz = .5, vec_ext = 4, size=1, cols = cols)

There are so many variables in the model, that the arrows look a bit messy. However, it is easy to see still which variables affect the classification most (zn, rad, nox). This ggord package was very nice and easy to use. You can see that the model does not classify crime rates perfectly but I would say it does pretty good job distinguishing high crime rate from others in train_set.

Test model

First need to create dataframe with correct crime classes for the test data and remove crime variable from test data that is used to predict the classes with the lda_fit.

#Save correct classes to variable
correct_classes <- test_set$crime

#Remove classes from test data
test_set <- dplyr::select(test_set, -crime)

#Predict classes with model
lda_predict <- predict(lda_fit, newdata = test_set)

#Make 2X2 table to observe model accuracy

table(correct = correct_classes, predicted = lda_predict$class)
##           predicted
## correct    low med_low med_high high
##   low       17      10        1    0
##   med_low    2      20        7    0
##   med_high   1       9        9    0
##   high       0       0        0   25
#Calculate percentage of right predictions on test data
percent_correct <- 100 * mean(lda_predict$class==correct_classes)
percent_correct <- round(percent_correct, digits = 0)

percent_correct
## [1] 70

Above analysis of the model shows, that it predicted the crime class for 70 % of test_set data correctly. Prediction accuracy is 100 % for high crime rate but less accurate for lower crime level classes. The worst accuracy is for med_low crime rate where almost half of the test data was classified wrong.

K-means clustering

I am reading again Boston data set and scaling it for clustering by K-means. First I am calculating Euclidean distances:

data(Boston)
scaled_kmeans <- as.data.frame(scale(Boston))
eu_dist <- dist(scaled_kmeans)
summary(eu_dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Next I am running k-means clustering using defined seed:

library(ggplot2)
set.seed(123)

#Setting maximum number of clusters
max_seeds <- 10

#Finding optimal number of clusters with so called elbow method
twcss <- sapply(1:max_seeds, function(k) {kmeans(scaled_kmeans, k)$tot.withinss})
qplot(x = 1:max_seeds, y = twcss, geom = 'line')

I decided to use 3 clusters because twcss is still decreasing and I was not satisfied how 2 or more than 3 clusters looked like (I tested 2, 4 and 8 clusters).

set.seed(123)
km <- kmeans(scaled_kmeans, centers = 3)

cols <- pal_futurama()(3)
cols_clusters <- cols[km$cluster]
pairs(scaled_kmeans, col = cols_clusters)

Bonus: LDA using k-means clusters

Fitting lda() using k-means clusters as dependent variables and all variables in data set as explanatory variables. Data used is scaled Boston data.

lda_kmeans <- lda(km$cluster ~ ., data = scaled_kmeans)
lda_kmeans
## Call:
## lda(km$cluster ~ ., data = scaled_kmeans)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2806324 0.3992095 0.3201581 
## 
## Group means:
##         crim         zn        indus        chas         nox         rm
## 1  0.9693718 -0.4872402  1.074440092 -0.02279455  1.04197430 -0.4146077
## 2 -0.3549295 -0.4039269  0.009294842  0.11748284  0.01531993 -0.2547135
## 3 -0.4071299  0.9307491 -0.953383032 -0.12651054 -0.93243813  0.6810272
##          age        dis        rad        tax     ptratio      black
## 1  0.7666895 -0.8346743  1.5010821  1.4852884  0.73584205 -0.7605477
## 2  0.3096462 -0.2267757 -0.5759279 -0.4964651 -0.09219308  0.2473725
## 3 -1.0581385  1.0143978 -0.5976310 -0.6828704 -0.53004055  0.3582008
##         lstat       medv
## 1  0.85963373 -0.6874933
## 2  0.09168925 -0.1052456
## 3 -0.86783467  0.7338497
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim     0.03654114  0.20373943
## zn      -0.08346821  0.34784463
## indus   -0.32262409 -0.12105014
## chas    -0.04761479 -0.13327215
## nox     -0.13026254  0.15610984
## rm       0.13267423  0.44058946
## age     -0.11936644 -0.84880847
## dis      0.23454618  0.58819732
## rad     -1.96894437  0.57933028
## tax     -1.10861600  0.53984421
## ptratio -0.13087741 -0.02004405
## black    0.15432491 -0.06106305
## lstat   -0.14002173  0.14786473
## medv     0.02559139  0.37307811
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8999 0.1001

To visualize fitted model, I use again ggord function. For it to work, km$clusters need to be converted to factor() because it can’t be in numeric form for this function. Using same colors as before.

cols <- pal_aaas()(3)

ggord(lda_kmeans, factor(km$cluster), poly = FALSE, arrow=.3, veclsz = .5, vec_ext = 4, size=1, cols = cols)

All 3 clusters can be separated quite nicely from each other, although only cluster 2 is clearly distinct from two others. Clustering is anyway better than clusters for crime rates as target classes. In this model, the most influential variables are tax, rad and age. However, it was clear that everytime k-means is executed, the clusters formed will be different making the interpretation and meaning of different clusters quite difficult.

Below is shown LDA for k-means with 6 clusters and this shows crim and black as most influential variables but not all cluster separate nicely based on LD1 and LD2 that explain around 70 % of effect.

#6 clusters

set.seed(123)
km6 <- kmeans(scaled_kmeans, centers = 6)

lda_kmeans6 <- lda(km6$cluster ~ ., data = scaled_kmeans)

cols <- pal_aaas()(6)

ggord(lda_kmeans6, factor(km6$cluster), poly = FALSE, arrow=.3, veclsz = .5, vec_ext = 4, size=1, cols = cols)

Super-bonus

model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
dim(model_predictors)
## [1] 405  13
dim(lda_fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_fit$scaling
matrix_product <- as.data.frame(matrix_product)

#Next, install and access the plotly package. Create a 3D plot (Cool!) of the columns of the matrix product by typing the code below.
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train_set$crime)